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Abstract 

We introduce a class of neural networks derived from probabilistic models in 
the form of Bayesian networks. By imposing additional assumptions about 
the nature of the probabilistic models represented in the networks, we derive 
neural networks with standard dynamics that require no training to deter- 
mine the synaptic weights, that perform accurate calculation of the mean 
values of the random variables, that can pool multiple sources of evidence, 
and that deal cleanly and consistently with inconsistent or contradictory ev- 
idence. The presented neural networks capture many properties of Bayesian 
networks, providing distributed versions of probabilistic models. 

Keywords: Neural networks, probabilistic models, Bayesian networks, 
Bayesian inference, neural information processing, population coding 

1. Introduction 

Artificial neural networks are noted for their ability to learn functional 
relationships from observed data. Unfortunately, a trained neural network 
is typically a black box, so that it can be quite difficult to determine what 
function is actually represented by the network. In numerous cases, neural 
networks have been related to probabilistic models, with either the trained 
network retrospectively given a probabilitic interpretation or the training 
process itself explicitly based on a probabilistic strategy. Alternatively, a con- 
structive approach can be taken to exploring representation of probabilistic 



* Corresponding author 
Email address: michael .barberOait . ac . at (Michael J. Barber ) 



Preprint submitted to arXiv 



April 30, 2010 



models in neural networks, encoding pre-specified probabilistic models into 
network weights. A key challenge in this alternate approach is to produce 
reasonable neural networks, allowing a suitably broad class of probabilistic 
models to be encoded into neural networks with recognizable architectures 
and dynamics. Towards this end, in this paper we formulate and characterize 
an encoding method that handles a restricted class of probabilistic models 
and allows calculation, without training, of neural networks that accurately 
process the mean values of the random variables with the usual neural acti- 
vation of a weighted sum of the neural inputs transformed with a nonlinear 
activation function. 

Probabilistic formulations of neural information processing have been ex- 
plored along a number of avenues. One of the earliest such analyses showed 
that the original Hopfield neural network implements, in effect, Bayesian 
inference on analog quantities in terms of probability densities [1]. Zemel 
et al. [2] have investigated population coding of probability distributions, 
but with different representations and dynamics than those we consider in 
this paper. Several extensions of this representation scheme have been devel- 
oped [3, 4, 5] that feature information propagation between interacting neural 
populations. Additionally, several "stochastic machines" [6] have been for- 
mulated, including Boltzmann machines [7], sigmoid belief networks [8], and 
Helmholtz machines [9]. Stochastic machines are built of stochastic neurons 
that occupy one of two possible states in a probabilistic manner. Learning 
rules for stochastic machines enable such systems to model the underlying 
probability distribution of a given data set. 

Additionally, the connection between neural networks and probabilistic 
models represented specifically as Bayesian networks [10, 11] has been ex- 
plored along two main lines. In one approach, the neural network architec- 
ture and activation dynamics are specified, with a learning rule used that at- 
tempts to capture the appropriate Bayesian network in the synaptic weights 
based on observed patterns [8, 12]. In a second approach, a prespecified 
Bayesian network is transformed into a neural network using an encoding 
process [13, 14]. While specific Bayesian networks are readily given in the 
latter approach, the neural architecture and dynamics arise from the encod- 
ing and need not match existing definitions. In particular, instead of the 
usual weighted sum of neural activation values passed through a nonlinear 
activation function, the encoding process can produce neural networks de- 
pending on multiplicative interactions between neural activities. 

In this work, we further explore the latter, encoding-based approach. By 
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imposing additional, strong assumptions about the originating Bayesian net- 
work, we develop neural networks processing mean values of analog variables, 
where all weights are calculated and no learning process is needed. The ran- 
dom variables are assumed to be normally distributed, which results in only 
the mean values being accurately encoded into the neural network. The re- 
sulting dynamics are of the usual form for neural networks, i.e., a weighted 
sum of the neural inputs transformed with a nonlinear activation function. 

We begin with a brief summary of the key relevant properties of Bayesian 
networks in Section 2. We describe a procedure for generating and evaluating 
the neural networks in Section 3, and apply the procedure to several examples 
in Section 4. 

2. Bayesian Networks 

Bayesian networks [10, 11] are directed acyclic graphs that represent prob- 
abilistic models (Fig. 1). Each node represents a random variable, and the 
arcs signify the presence of dependence between the linked variables. The 
strengths of these influences are defined using conditional probabilities. We 
additionally take the direction of a particular link to indicate the direction 
of causality (or, more simply, relevance), with an arc pointing from cause to 
effect; in this form, the Bayesian network is also called a causal network. 

Multiple sources of evidence about the random variables are conveniently 
handled using Bayesian networks. The belief, or degree of confidence, in 
particular values of the random variables is determined as the likelihood of the 
value given evidentiary support provided to the network. There are two types 
of support that arise from the evidence: predictive support, which propagates 
from cause to effect along the direction of the arc, and retrospective support, 
which propagates from effect to cause, opposite to the direction of the arc. 

Bayesian networks have two properties that we will find very useful, both 
of which stem from the dependence relations shown by the graph structure. 
First, the value of a node X is not dependent upon all of the other graph 
nodes. Rather, it depends only on a subset of the nodes, called a Markov 
blanket of X, that separates node X from all the other nodes in the graph. 
The Markov blanket of interest to us is readily determined from the graph 
structure. It is comprised of the union of the direct parents of X, the direct 
successors of X, and all direct parents of the direct successors of X. Second, 
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Figure 1: A Bayesian network. Evidence about any of the random variables influences the 
likelihood of the remaining random variables. In a straightforward terminology the node 
at the tail of an arrow is a parent of the child node at the head of the arrow, e.g. X4 is 
a parent of X$ and a child of both X2 and X3 . From the structure of the graph, we can 
see the conditional independence relations in the probabilistic model. For example, X$ is 
independent of X\ and X 2 given X 3 and X4. 

the joint probability over the random variables is decomposable as 

n 

P(x 1 ,x 2 ,...,x n ) = Y[P(x li \Pa,(x li )) , (1) 

where Pa(a; At ) denotes the (possibly empty) set of direct-parent nodes of X^. 
This decomposition comes about from repeated application of Bayes' rule 
and from the structure of the graph. 

3. Neural Network Model 

We will develop neural networks from the set of marginal distributions 
{p(x fM ; t)} so as to best match a desired probabilistic model p(xi,x 2 , ■ ■ ■ , xp) 
over the set of random variables, which are organized as a Bayesian network. 
One or more of the variables x^ must be specified as evidence in the Bayesian 
network. To facilitate the development of general update rules, we do not 
distinguish between evidence and non-evidence nodes in our notation. 

Our general approach will be to minimize the difference between a prob- 
abilistic model p(xi,X2, ■ ■ ■ ,xd) and an estimate of the probabilistic model 
p(xi,X2, . . . ,xd)- For the estimate, we utilize 

p(x 1 ,x 2 , ■ ■ ■ ,x D ) = Y[p(x a ;t) . (2) 

a 



4 



This is a so-called naive estimate, wherein the random variables are assumed 
to be independent. We will place further constraints on the probabilistic 
model and representation to produce neural networks with the desired dy- 
namics. 

The first assumption we make is that the populations of neurons only need 
to accurately encode the mean values of the random variables, rather than 
the complete densities. We take the firing rates of the neurons representing 
a given random variable X a to be functions of the mean value x a (t) 

af(t) = g(A?x a (t) + B?) , (3) 

where Af and Bf are parameters describing the response properties of neuron 
i of the population representing random variable X a . The activation function 
g is in general nonlinear; in this work, we take g to be the logistic function, 

9 (s) = i —, \ • (4) 

yv 1 l + exp(-x) 

We use a set of neural response function (Fig. 2) similar to ones from work 
on population-temporal coding that supported manipulation of mean values 
[15, 16]. We can make use of Eq. 3 to directly encode mean values into neural 
activation states, providing a means to specify the value of the evidence nodes 
in the NBN. 

Using Eq. 3, we derive an update rule describing the neuronal dynamics, 
obtaining (to first order in r) 

ant + r)=g^Atx a (t) + rAt^^ + B^j . (5) 

Thus, if we can determine how x a changes with time, we can directly deter- 
mine how the neural activation states change with time. 

The mean value x a (t) can be determined from the firing rates as the 
expectation value of the random variable X a with respect to a density p(x a ] t) 
represented in terms of some decoding functions {<f)f (x a )} The density is 
recovered using the relation 

p(x a ,t) = J2aM(x a ) . (6) 

i 

The decoding functions are constructed so as to minimize the difference be- 
tween the assumed and reconstructed densities (discussed in detail in [17]). 
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Figure 2: The mean values of the random variables are encoded into the firing rates 
of populations of neurons. A population of twenty neurons with sigmoidal responses is 
associated with each random variable. The neuronal responses af are fully determined 
by a single input £, which we interpret as the mean value of a density. The form of the 
neuronal transfer functions can be altered without affecting the general result presented 
in this work. 
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With representations as given in Eq. 6, we have 

•^q(^) J X a p (^ai d,X a 

= 5>?(f)5?, (7) 

i 

where we have defined 

x t = J x a<f>f (x a ) dx^ . (8) 

Although we used the decoding functions (ftf (x a ) to calculate the parameters 
xf, they can in practice be found directly so that the relations in Eqs. 3 and 7 
are mutually consistent. 

We take the densities p(x a ;t) to be normally distributed with the form 
p(x a ;t) = p(x a ;x a (t)) = N(x a ;x a (t),al a ). Intuitively, we might expect that 
the variance o\ a should be small so that the mean value is coded precisely, 
but we will see that the variances have no significance in the resulting neural 
networks. 

The second assumption we make is that interactions between the nodes 
are linear: 

xp = ^ K P*x a . (9) 

a 

Utilizing the causality relations given by the Bayesian network, we require 
that Kp a 7^ only if Xp is a child node of X a in the network graph. To 
represent the linear interactions as a probabilistic model, we take the nor- 
mal distributions p(xp \ P&(xp)) = N(xp] J2 a ^/3aX a ,ap) for the conditional 
probabilities. 

For nodes in the Bayesian network which have no parents, the conditional 
probability p{xp \ Pa(x ( g)) is just the prior probability distribution p(xp). 
We utilize the same rule to define the prior probabilities as to define the 
conditional probabilities. For parentless nodes, the prior is thus normally 
distributed with zero mean, p(xp) = N(xp; 0, ai). 

We use the relative entropy [18] as a measure of the "distance" between 
the joint distribution describing the probabilistic model p(xi,x 2 , ■ ■ ■ , xd) and 
the density estimated from the neural network p(xi,X2, ■ ■ ■ ,xd)- Thus, we 
minimize 



B = -/«*,,*,..., * D ) log ( f- ri "» | d Xl dx 2 ■ ■ ■ d XD 
J \p{x 1 ,x 2 ,...,x D )J 



(10) 
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with respect to the mean values x a . By making use of the gradient descent 
prescription 

dx., dE 

and the decomposition property for Bayesian networks given by Eq. 1, we 
obtain the update rule for the mean values, 



dt ~ al 



^2 K lP Xp - X 7 ) - T] ^ — f ^ K PocXa -Xp) • (12) 
'1 \ P J P a P V a J 

Because the coupling parameters K a p are nonzero only when X a is a parent 
of Xp, generally only a subset of the mean values contributes to updating 
x 7 in Eq. 12. In terms of the Bayesian network graph structure, the only 
contributing values come from the parents of X 7 , the children of X y , and 
the parents of the children of X 7 ; this is identical to the Markov blanket 
discussed in Section 2. 

The update rule for the neural activities is obtained by combining Eqs. 5, 
7 and 12, resulting in 

al(t + T)=g(^S7 J a](t) + Bl + r ] Th](t) ) j . (13) 

The quantity . S^aJ (t) + B] serves to stabilize the activities of the neurons 
representing p(x 7 ), while 

Kit) = £ T^{t) + (Vf + V*) a] (t) + W]? a a«(t) (14) 

3 P 3 a,P j 

drives changes in a](t) based on the densities represented by other nodes of 
the Bayesian network. The synaptic weights of the neural network are 

(15) 
(16) 

(17) 

(18) 

(19) 



C7 _ 

*>ij - 


A i X j , 


T- 7 = 

y 


A 1 -r 7 
/lj „ X j , 

7 


w p = 


Aj—K^/sx] , 

7 


v Pl = 

13 


a P 


w lPa = 

Jl 


Aj—K lP K Pa x] 
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The foregoing provides an algorithm for generating and evaluating neural 
networks that process mean values of random variables. To summarize, 

1. Establish independence relations between model variables. This may 
be accomplished by using a graph to organize the variables. 

2. Specify the K a p to quantify the relations between the variables. 

3. Assign network inputs by encoding desired values into neural activities 
using Eq. 3. 

4. Update other neural activities using the update rule in Eq. 13 and the 
supporting definitions in Eqs. 14 to 19. 

5. Extract the expectation values of the variables from the neural activi- 
ties using Eq. 7. 

4. Examples 

As a first example, we apply the algorithm to the Bayesian network shown 
in Fig. 1, with firing rate profiles as shown in Fig. 2. Specifying x\ = 1/2 and 
x 2 = —1/2 as evidence, we find an excellent match between the mean values 
calculated by the neural network and the directly calculated values for the 
remaining nodes (Table 1). 

We next focus on some simpler Bayesian networks to highlight certain 
properties of the resulting neural networks (which will again utilize the fir- 
ing rate profiles shown in Fig. 2). In Fig. 3, we present two Bayesian net- 
works that relate three random variables in different ways. The connection 
strengths are all taken to be unity in each graph, so that K 2 \ = K 23 = K X2 = 
K 13 = 1. 



Table 1: The mean values decoded from the neural network closely match the values 
directly calculated from the linear relations. The coefficients for the linear combinations 
were randomly selected, with values K 3 i = —0.2163, K 32 = —0.8328, K42 = 0.0627, 
K i3 = 0.1438, K 53 = -0.5732, and K 5i = 0.5955. 

Node Direct Calculation Neural Network 
X 1 0.5000 0.5000 
X 2 -0.5000 -0.5000 
X 3 0.3083 0.3084 
X 4 0.0130 0.0128 
X 5 -0.1690 -0.1689 
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(a) 



(b) 



Figure 3: Simpler Bayesian networks. Although the underlying undirected graph structure 
is identical for these two networks, the direction of the causality relationships between the 
variables are reversed. The neural networks arising from the Bayesian networks thus have 
different properties. 

With the connection strengths so chosen, the two Bayesian networks have 
straightforward interpretations. For the graph shown in Fig. 3(a), X 2 rep- 
resents the sum of X\ and X 3 , while, for the graph shown in Fig. 3(b), X 2 
provides a value which is duplicated in X\ and X 3 . The different graph 
structures yield different neural networks; in particular, nodes X\ and X 3 
have direct synaptic connections in the neural network based on the graph 
in Fig. 3(a), but no such direct weights exist in a second network based on 
Fig. 3(b). Thus, specifying x\ = —1/4 and x 2 = 1/4 for the first network 
produces the expected result x 3 = —0.5000, but specifying x 2 — 1/4 in the 
second network produces x 3 = 0.2500 regardless of the value (if any) assigned 
to x 1 . 

To further illustrate the neural network properties, we use the graph 
shown in Fig. 3(b) to process inconsistent evidence. Nodes X 1 and X 3 should 
copy the value in node X 2 , but we can specify any values we like as network 
inputs. For example, when we assign x\ = —1/4 and x 3 = 1/2, the neu- 
ral network yields x 2 = 0.1250 for the remaining value. This is a typical 
and reasonable result, matching the least-squares solution to the inconsis- 
tent problem. 

5. Conclusion 

We have introduced a class of neural networks that consistently mix mul- 
tiple sources of evidence. The networks are based on probabilistic models, 
represented in the graphical form of Bayesian networks, and function based 



10 



on traditional neural network dynamics (i.e., a weighted sum of neural activa- 
tion values passed through a nonlinear activation function). We constructed 
the networks by restricting the represented probabilistic models by introduc- 
ing two auxiliary assumptions. 

First, we assumed that only the mean values of the random variables need 
to be accurately represented, with higher order moments of the distribution 
being unimportant. We introduced neural representations of relevant proba- 
bility density functions consistent with this assumption. Second, we assumed 
that the random variables of the probabilistic model are linearly related to 
one another, and chose appropriate conditional probabilities to implement 
these linear relationships. 

Using the representations suggested by our auxiliary assumptions, we 
derived a set of update rules by minimizing the relative entropy of an assumed 
density with respect to the density decoded from the neural network. In a 
straightforward fashion, the optimization procedure yields neural weights and 
dynamics that implement specified probabilistic relations, without the need 
for a training process. 

The neural networks investigated in this work captures many of the prop- 
erties of both Bayesian networks and traditional neural network models. In 
particular, multiple sources of evidence are consistently pooled based on local 
update rules, providing a distributed version of a probabilistic model. 
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